136 research outputs found
A Convergence Theorem for the Graph Shift-type Algorithms
Graph Shift (GS) algorithms are recently focused as a promising approach for
discovering dense subgraphs in noisy data. However, there are no theoretical
foundations for proving the convergence of the GS Algorithm. In this paper, we
propose a generic theoretical framework consisting of three key GS components:
simplex of generated sequence set, monotonic and continuous objective function
and closed mapping. We prove that GS algorithms with such components can be
transformed to fit the Zangwill's convergence theorem, and the sequence set
generated by the GS procedures always terminates at a local maximum, or at
worst, contains a subsequence which converges to a local maximum of the
similarity measure function. The framework is verified by expanding it to other
GS-type algorithms and experimental results
In-Depth Behavior Understanding and Use: The Behavior Informatics Approach
The in-depth analysis of human behavior has been increasingly recognized as a
crucial means for disclosing interior driving forces, causes and impact on
businesses in handling many challenging issues. The modeling and analysis of
behaviors in virtual organizations is an open area. Traditional behavior
modeling mainly relies on qualitative methods from behavioral science and
social science perspectives. The so-called behavior analysis is actually based
on human demographic and business usage data, where behavior-oriented elements
are hidden in routinely collected transactional data. As a result, it is
ineffective or even impossible to deeply scrutinize native behavior intention,
lifecycle and impact on complex problems and business issues. We propose the
approach of Behavior Informatics (BI), in order to support explicit and
quantitative behavior involvement through a conversion from source data to
behavioral data, and further conduct genuine analysis of behavior patterns and
impacts. BI consists of key components including behavior representation,
behavioral data construction, behavior impact analysis, behavior pattern
analysis, behavior simulation, and behavior presentation and behavior use. We
discuss the concepts of behavior and an abstract behavioral model, as well as
the research tasks, process and theoretical underpinnings of BI. Substantial
experiments have shown that BI has the potential to greatly complement the
existing empirical and specific means by finding deeper and more informative
patterns leading to greater in-depth behavior understanding. BI creates new
directions and means to enhance the quantitative, formal and systematic
modeling and analysis of behaviors in both physical and virtual organizations
Asymptotic power of likelihood ratio tests for high dimensional data
This paper considers the asymptotic power of likelihood ratio test (LRT) for
the identity test when the dimension p is large compared to the sample size n.
The asymptotic distribution of LRT under alternatives is given and an explicit
expression of the power is derived. A simulation study is carried out to
compare LRT with other tests. All these studies show that LRT is a powerful
test to detect eigenvalues around zero.
Key words and phrases: Covariance matrix, High dimensional data, Identity
test, Likelihood ratio test, PowerComment: 10 pages, 2 figure
Non-parametric Power-law Data Clustering
It has always been a great challenge for clustering algorithms to
automatically determine the cluster numbers according to the distribution of
datasets. Several approaches have been proposed to address this issue,
including the recent promising work which incorporate Bayesian Nonparametrics
into the -means clustering procedure. This approach shows simplicity in
implementation and solidity in theory, while it also provides a feasible way to
inference in large scale datasets. However, several problems remains unsolved
in this pioneering work, including the power-law data applicability, mechanism
to merge centers to avoid the over-fitting problem, clustering order problem,
e.t.c.. To address these issues, the Pitman-Yor Process based k-means (namely
\emph{pyp-means}) is proposed in this paper. Taking advantage of the Pitman-Yor
Process, \emph{pyp-means} treats clusters differently by dynamically and
adaptively changing the threshold to guarantee the generation of power-law
clustering results. Also, one center agglomeration procedure is integrated into
the implementation to be able to merge small but close clusters and then
adaptively determine the cluster number. With more discussion on the clustering
order, the convergence proof, complexity analysis and extension to spectral
clustering, our approach is compared with traditional clustering algorithm and
variational inference methods. The advantages and properties of pyp-means are
validated by experiments on both synthetic datasets and real world datasets
Data Science: Nature and Pitfalls
Data science is creating very exciting trends as well as significant
controversy. A critical matter for the healthy development of data science in
its early stages is to deeply understand the nature of data and data science,
and to discuss the various pitfalls. These important issues motivate the
discussions in this article
Data Science: Challenges and Directions
While data science has emerged as a contentious new scientific field,
enormous debates and discussions have been made on it why we need data science
and what makes it as a science. In reviewing hundreds of pieces of literature
which include data science in their titles, we find that the majority of the
discussions essentially concern statistics, data mining, machine learning, big
data, or broadly data analytics, and only a limited number of new data-driven
challenges and directions have been explored. In this paper, we explore the
intrinsic challenges and directions inspired by comprehensively exploring the
complexities and intelligence embedded in data science problems. We focus on
the research and innovation challenges inspired by the nature of data science
problems as complex systems, and the methodologies for handling such systems
Coupling Learning of Complex Interactions
Complex applications such as big data analytics involve different forms of
coupling relationships that reflect interactions between factors related to
technical, business (domain-specific) and environmental (including
socio-cultural and economic) aspects. There are diverse forms of couplings
embedded in poor-structured and ill-structured data. Such couplings are
ubiquitous, implicit and/or explicit, objective and/or subjective,
heterogeneous and/or homogeneous, presenting complexities to existing learning
systems in statistics, mathematics and computer sciences, such as typical
dependency, association and correlation relationships. Modeling and learning
such couplings thus is fundamental but challenging. This paper discusses the
concept of coupling learning, focusing on the involvement of coupling
relationships in learning systems. Coupling learning has great potential for
building a deep understanding of the essence of business problems and handling
challenges that have not been addressed well by existing learning theories and
tools. This argument is verified by several case studies on coupling learning,
including handling coupling in recommender systems, incorporating couplings
into coupled clustering, coupling document clustering, coupled recommender
algorithms and coupled behavior analysis for groups
Data Science: A Comprehensive Overview
The twenty-first century has ushered in the age of big data and data economy,
in which data DNA, which carries important knowledge, insights and potential,
has become an intrinsic constituent of all data-based organisms. An appropriate
understanding of data DNA and its organisms relies on the new field of data
science and its keystone, analytics. Although it is widely debated whether big
data is only hype and buzz, and data science is still in a very early phase,
significant challenges and opportunities are emerging or have been inspired by
the research, innovation, business, profession, and education of data science.
This paper provides a comprehensive survey and tutorial of the fundamental
aspects of data science: the evolution from data analysis to data science, the
data science concepts, a big picture of the era of data science, the major
challenges and directions in data innovation, the nature of data analytics, new
industrialization and service opportunities in the data economy, the profession
and competency of data education, and the future of data science. This article
is the first in the field to draw a comprehensive big picture, in addition to
offering rich observations, lessons and thinking about data science and
analytics
Non-IID Recommender Systems: A Review and Framework of Recommendation Paradigm Shifting
While recommendation plays an increasingly critical role in our living,
study, work, and entertainment, the recommendations we receive are often for
irrelevant, duplicate, or uninteresting products and services. A critical
reason for such bad recommendations lies in the intrinsic assumption that
recommended users and items are independent and identically distributed (IID)
in existing theories and systems. Another phenomenon is that, while tremendous
efforts have been made to model specific aspects of users or items, the overall
user and item characteristics and their non-IIDness have been overlooked. In
this paper, the non-IID nature and characteristics of recommendation are
discussed, followed by the non-IID theoretical framework in order to build a
deep and comprehensive understanding of the intrinsic nature of recommendation
problems, from the perspective of both couplings and heterogeneity. This
non-IID recommendation research triggers the paradigm shift from IID to non-IID
recommendation research and can hopefully deliver informed, relevant,
personalized, and actionable recommendations. It creates exciting new
directions and fundamental solutions to address various complexities including
cold-start, sparse data-based, cross-domain, group-based, and shilling
attack-related issues
Characterizing A Database of Sequential Behaviors with Latent Dirichlet Hidden Markov Models
This paper proposes a generative model, the latent Dirichlet hidden Markov
models (LDHMM), for characterizing a database of sequential behaviors
(sequences). LDHMMs posit that each sequence is generated by an underlying
Markov chain process, which are controlled by the corresponding parameters
(i.e., the initial state vector, transition matrix and the emission matrix).
These sequence-level latent parameters for each sequence are modeled as latent
Dirichlet random variables and parameterized by a set of deterministic
database-level hyper-parameters. Through this way, we expect to model the
sequence in two levels: the database level by deterministic hyper-parameters
and the sequence-level by latent parameters. To learn the deterministic
hyper-parameters and approximate posteriors of parameters in LDHMMs, we propose
an iterative algorithm under the variational EM framework, which consists of E
and M steps. We examine two different schemes, the fully-factorized and
partially-factorized forms, for the framework, based on different assumptions.
We present empirical results of behavior modeling and sequence classification
on three real-world data sets, and compare them to other related models. The
experimental results prove that the proposed LDHMMs produce better
generalization performance in terms of log-likelihood and deliver competitive
results on the sequence classification problem
- …